Efficient motion -vector prediction for unconstrained and lifting-based motion compensated temporal filtering

ABSTRACT

Video coding method and device for reducing the number of motion vector bits, the method and device differentially coding the motion vectors at each temporal decomposition level by predicting the motion vectors temporally and coding the differences.

This application claims the benefit under 35 USC 119(e) of U.S.provisional application Ser. No. 60/416,592, filed on Oct. 7, 2002,which is incorporated herein by reference.

The present invention relates generally to video coding, and moreparticularly, to wavelet based coding utilizing differential motionvector coding in unconstrained and lifting-based motion compensatedtemporal filtering.

Unconstrained motion compensated temporal filtering (UMCTF) andlifting-based motion compensated temporal filtering (MCTF) are used formotion-compensated wavelet coding. These MCTF schemes use similar motioncompensation techniques, e.g. bi-directional filtering, multiplereference frames etc., to eliminate the temporal correlation in thevideo. Both UMCTF and lifting-based MCTF, outperform uni-directionalMCTF schemes.

In providing good temporal decorrelation, UMCTF and lifting-based MCTFhave the disadvantage of requiring the transmission of additional motionvectors (MVs), which all need to be encoded. This is demonstrated inFIG. 1, which shows an example of UMCTF without multiple referenceframes, but with only bi-directional filtering. As can be seen, the MVsin each of the temporal decomposition levels (MV 1 and MV 2 in level 0and MV3 in level 1) are independently estimated and encoded. Sincebi-directional motion estimation is performed at multiple temporaldecomposition levels, the number of additional MVs bits increases withthe number of decomposition levels. Similarly, the larger the number ofreference frames used during temporal filtering, the greater the numberof MVs that need to be transmitted. Compared to a hybrid video codingscheme or to a Haar temporal decomposition, the number of MV fields isalmost double. This can negatively affect the efficiency of UMCTF andlifting-based MCTF for bi-directional motion-compensated wavelet codingat low transmission bit-rates.

Accordingly, a method is needed which reduces the number of bits spentfor coding MVs in an unconstrained or lifting-based MCTF scheme.

The present invention is directed to methods and devices for codingvideo in a manner that reduces the number of motion vector bits.According to the present invention, the motion vectors aredifferentially coded at each temporal decomposition level by predictingthe motion vectors temporally and coding the differences.

FIG. 1 shows an example of UMCTF without multiple reference frames, butwith only bi-directional filtering.

FIG. 2 shows an embodiment of an encoder which may be used forimplementing the principles of the present invention.

FIG. 3 shows an exemplary GOF which considers three motion vectors, attwo different temporal decomposition levels.

FIG. 4 is a flow chart showing a top down prediction and codingembodiment of the method of the present invention.

FIGS. 5A, 5B, 6A, 6B, and 7 show results for two difference videosequences using the top down prediction and coding embodiment of themethod of the present invention.

FIG. 8 shows an example of top down prediction during motion estimation.

FIG. 9 shows results for two difference video sequences using the topdown prediction during motion estimation.

FIG. 10 is a flow chart showing a bottom up prediction and codingembodiment of the method of the present invention.

FIGS. 11A, 11B, 11A, 11B, and 13 show results for two difference videosequences using the bottom up prediction and coding embodiment of themethod of the present invention.

FIG. 14 shows results for two difference video sequences using the topdown prediction during motion estimation.

FIG. 15 shows motion vector bits for frame within a group of framesusing the top down prediction during motion estimation.

FIG. 16 shows two levels of bi-directional MCTF with lifting.

FIG. 17 shows a mixed, hybrid prediction and coding embodiment of themethod of the present invention.

FIG. 18 shows an embodiment of a decoder which may be used forimplementing the principles of the present invention.

FIG. 19 shows an embodiment of a system in which the principles of thepresent invention may be implemented.

The present invention is a differential motion vector coding method,which reduces the number of bits needed for encoding motion vectors(MVs) generated during unconstrained and lifting-based motioncompensated temporal filtering for bi-directional motion-compensatedwavelet coding. The method encodes the MVs differentially at the varioustemporal levels. This is generally accomplished by temporally predictingthe MVs and encoding the differences using any conventional encodingscheme.

FIG. 2 shows an embodiment of an encoder which may be used forimplementing the principles of the present invention, denoted by numeral100. The encoder 100 includes a partitioning unit 120 for dividing aninput video into a group of frames (GOFs), which are encoded as a unit.An unconstrained or lifting-based MCTF unit 130 is included that has amotion estimation unit 132 and a temporal filtering unit 134. The motionestimation unit 132 performs bi-directional motion estimation orprediction on the frames in each GOF according to the method of thepresent invention, as will be explained in detail further on. Thetemporal filtering unit 134 removes temporal redundancies between theframes of each GOF according to the motion vectors MV and frame numbersprovided by the motion estimation unit 132. A spatial decomposition unit140 is included to reduce the spatial redundancies in the framesprovided by the MCTF unit 130. During operation, the frames receivedfrom the MCTF unit 130 may be spatially transformed by the spatialdecomposition unit 140 into wavelet coefficients according to a 2Dwavelet transform. There are many different types of known filters andimplementations of the wavelet transform. A significance encoding unit150 is included to encode the output of the spatial decomposition unit140 according to significance information, such as the magnitude of thewavelet coefficients, where larger coefficients are more significantthan smaller coefficients. An entropy encoding unit 160 is included toproduce the output bit-stream. The entropy encoding unit 160 entropyencodes the wavelet coefficients into an output bit-stream. The entropyencoding unit 160 also entropy encodes the MVs and frame numbersprovided by the motion estimation unit 130 according to the method ofthe present invention, as will be explained in detail further on. Thisinformation is included in the output bit-stream in order to enabledecoding. Examples of a suitable entropy encoding technique includewithout limitation arithmetic encoding and variable length encoding.

The differential motion vector encoding method will now be describedwith reference to the GOF of FIG. 3, which for simplicity of descriptiononly, considers three motion vectors, at two different temporaldecomposition levels, which may be called level 0 and level 1. MV1 andMV2 are the bi-directional motion vectors connecting an H-frame (themiddle frame) to a previous A-frame (the left A-frame) and a proceedingA-frame (the right A-frame) at temporal decomposition level 0. Afterfiltering at this temporal decomposition level, the A-frames are thenfiltered at the next temporal decomposition level, i.e., level 1,wherein MV3 corresponds to the motion vector connecting these twoframes.

In accordance with a top down prediction and coding embodiment of themethod of the present invention, the steps of which are shown in theflow chart of FIG. 4, the MVs at level 0 are used to predict the MVs atlevel 1 and so on. Using the simplified example of FIG. 3, step 200includes determining MV1 and MV2. MV1 and MV2 may be determinedconventionally by the motion estimation unit 132, at level 0 duringmotion estimation. During motion estimation, groups of pixels or regionsin the H-frame are matched with similar groups of pixels or regions inthe previous A-frame to obtain MV1, and groups of pixels or regions inthe H-frame are matched with similar groups of pixels or regions in theproceeding A-frame to obtain MV2. In step 210, MV3 is estimated orpredicted for level 1 as a refinement based on MV1 and MV2. Theestimation for MV3 is an estimation of the groups of pixels or regionsin the proceeding A-frame from level 0, which match similar groups ofpixels or regions in the previous A-frame from level 0. The estimationor prediction of MV3 may be obtained by calculating the differencebetween MV 1 and MV2. In step 220, the entropy encoding unit 160 (FIG.2) entropy encodes MV1 and MV2. The method may end here or optionally instep 230, the entropy encoding unit 160 may also encode refinement forMV3.

Since MV 1 and MV2 are likely to be accurate (due to the smallerdistance between the frames), the prediction for MV3 is likely to begood, thereby leading to increased coding efficiency. Results for twodifference video sequences are shown in FIGS. 5A, 5B, 6A, and 6B. Bothsequences are QCIF at 30 Hz. A GOF size of 16 frames, a four leveltemporal decomposition, and a fixed block size of 16×16, and a searchrange of ±64 were used in these examples. The results present theforward and backward MVs separately, and are shown across the differentGOFs in the sequence, in order to highlight the content dependent natureof the results. The same graphs also plot the result of using noprediction for coding the MVs, and spatial prediction. The resultingbits needed for the coding are summarized in the table of FIG. 7.

As expected, due to the greater temporally correlated motion in theCoastguard video sequence of FIGS. 5A and 5B, there are larger savingsin bits. It is important to realize the content dependent nature ofthese results. For instance, near the end of the Foreman video sequenceof FIGS. 6A and 6B, the motion is very small, and is spatially very wellcorrelated. This leads to very good performance by the spatialpredictive coding of MVs. Also, during the sudden camera motion in theCoastguard video sequence, around GOF 5, spatial and temporal predictionof motion does not provide many gains.

Because the top down prediction and coding embodiment of the method ofthe present invention realizes bit-rate savings, this embodiment of thepresent invention may also be utilized during the motion estimationprocess. An example of this is shown in FIG. 8.

After considering different search range sizes after prediction it wasobserved that this can provide interesting tradeoffs between thebit-rate, the quality, and the complexity of the estimation. The tableof FIG. 9 summarizes the results of different search-size windows aroundthe temporal prediction location (the temporal prediction is used as thesearch center).

The No prediction for the ME (motion estimation) row corresponds to theresults in the table of FIG. 7. As expected, due to the greatertemporally correlated motion in the Coastguard video sequence, there arelarger savings in MV bits. As may be seen by comparing other rows to the‘No pred for MV’ row, temporal MV prediction during estimation helps inreducing the MV bits further. This reduction in MV bits allows more bitsfor the texture, and thus higher PSNR when the motion is temporallycorrelated. With increasing range after prediction, the quality of thematches improves, so although the bits for MV increase, the PSNRactually improves. It must be mentioned that the results vary from GOFto GOF, depending on the content and the nature of the motion. For someGOFs improvements have been observed in PSNR of up to 0.4 dB, or MV bitsavings over spatial prediction of up to 12%.

One of the disadvantages of using the top down prediction and codingembodiment is the fact that all the motion vectors need to be decodedbefore the temporal recomposition. So MV1 and MV2 need to be decodedbefore MV3 can be decoded, and level 1 can be recomposed. This isunfavorable for temporal scalability, where some of the higher levelsneed to be decoded independently.

The top down prediction and coding embodiment may easily be used forcoding MVs within the lifting framework, where motion estimation athigher temporal levels is performed on filtered frames. However thegains of differential MV coding are likely to be smaller, due to thetemporal averaging used to create the L-frames. Firstly, temporalaveraging leads to some smoothing and smearing of objects in the scene.Also, when good matches cannot be found, some undesirable artifacts arecreated. In this case, using the motion vectors between unfilteredframes to predict the motion vectors between average frames, or viceversa, might lead to poor predictions. This can cause reduced efficiencyof the motion vector coding.

Referring now to the flow chart of FIG. 10, there is shown a bottom-upprediction and coding embodiment of the method of the present invention.In this embodiment, the MVs at level 1 are used to predict the MVs atlevel 0 and so on. Using the simplified example of FIG. 3 again, step300 includes determining MV3. MV3 may be determined conventionally bythe motion estimation unit 132, at level 1 during motion estimation.During motion estimation groups of pixels or regions in the proceedingA-frame from level 0 are matched to similar groups of pixels or regionsin the previous A-frame from level 0. In step 310, MV1 and MV2 for level0 are each estimated or predicted as a refinement based on MV3. Theestimate for MV1 is an estimate of the groups of pixels or regions inthe H-frame which match similar groups of pixels or regions in theprevious A-frame. The estimate for MV2 is an estimate of the groups ofpixels or regions in the H-frame that match similar groups of pixels orregions in the proceeding A-frame. The estimation of MV1 may be obtainedby calculating the difference between MV3 and MV2. The estimation of MV2may be obtained by calculating the difference between MV3 and MV1. Instep 320, the entropy encoding unit 160 (FIG. 2) entropy encodes MV3.The method may end here or optionally in step 330, the entropy encodingunit 160 may also encode the refinements for MV1 and/or MV2.

The bottom-up prediction and coding embodiment produces temporallyhierarchical motion vectors that may be used progressively at differentlevels of the temporal decomposition scheme. So MV3 can be used torecompose Level 1 without having to decode MV2 and MV1. Also, since MV3is now more important than MV2 and MV 1, as with the temporallydecomposed frames, it may easily be combined with unequal errorprotection (UEP) schemes to produce more robust bitstreams. This can bebeneficial especially in low bit-rate scenarios. However, the predictionscheme is likely to be less efficient than the top-down embodimentdescribed previously. This is because MV3 is likely to be inaccurate(due to the larger distance between the source and the reference frame)and the use of an inaccurate prediction can lead to increased bits. Asin the top-down embodiment, experiments were performed on the Foremanand Coastguard video sequences at the same resolutions and the samemotion estimation parameters. The results are presented in FIGS. 11A,11B, 12A, and 12B to show the gains of temporal prediction for codingalone (no prediction during motion estimation). The results of this aresummarized in the table of FIG. 13.

As expected the prediction results are not as good as in the Top-downembodiment, and there is a significant degradation in performanceespecially for GOFs, where the motion is not temporally correlated. FromFIGS. 11A and 11B, it can be seen that the temporal prediction performsextremely poorly for GOF 5 of the Coastguard video sequence. This isbecause around GOF 5 there is a sudden camera motion and the resultingmotion has low temporal correlation. It should be reemphasize that thecontent dependent nature of these results, and the fact that thedecision to use temporal filtering may be turned on and off adaptively.

Some of the above experiments were repeated using the bottom-upembodiment during motion estimation, the results of which are summarizedin the table of FIG. 14. As can be seen, the results are not as good asthe results for the top-down prediction embodiment. More interestingly,however, looking at the results for the Coastguard video sequence, itcan be seen that the number of bits for MVs after temporal predictiondecrease with increasing window size. This might appearcounter-intuitive, however it may be explained as follows. When thetemporal prediction is bad, then a small search window limits the resultto be close to this poor prediction, instead of allowing the finding ofa more accurate prediction. Although this small distance from theprediction results in fewer bits to code at the current level, nothaving a good prediction for the next (earlier) temporal level cansignificantly degrade the performance. This is actually clearlyindicated by the results in the table of FIG. 15. All these results arefrom a 16 frame GOF with 4 levels of temporal decomposition. MV bits areshown for 5 frames, frame 8 that is filtered at level 3, frames 4 and 12that are filtered at level 2, and frames 2 and 6 that are filtered atlevel 1. MVs of frame 8 are used to predict MVs of frames 4 and 12 andMVs of frame 4 are used to predict MVs of frames 2 and 6.

For frame 8, there is no temporal prediction, so the number of bits isthe same in both cases. The number of bits is smaller for the ±4 windowfor frames 4 and 12, due to the smaller window size. However, the factthat this results in poor prediction for the frames at level 1 isindicated by the fact that the MV bits from frame 6 are much smaller forthe ±16 window size. In fact, all the savings at level 2 are completelynegated at level 1. However, when the motion is temporally correlated,then the use of this scheme can results in bit rate savings as well asimproved PSNR.

An interesting extension of the idea to improve the results is possible.Since the predictions are desired to be as accurate as possible, a largewindow size needs to be started with at level 3, and then, decrease thewindow size across the different levels. For instance use a ±64 windowsize may be used at levels 3 and 2, and then decreased to a ±16 windowsize at level 1. This can lead to reduced bits along with improved PSNR.

All of the above discussion is for the UMCTF framework, where the motionestimation is performed on the original frames at all temporal levels.Adapting the above schemes for a lifting-based implementation, wheremotion estimation is performed at higher temporal levels on filtered Lframes, may be difficult. The earlier described top-down embodiment canbe adapted without difficulties, and it is expected that the resultswill be slightly better than for UMCTF, since the L frames are computedby taking into account the motion vectors estimated at lower temporallevels. However, for the bottom-up embodiment, some difficulties may beencountered, especially causality problems.

As shown in FIG. 16, in order to perform the bottom-up predictionembodiment during motion estimation, MV3 needs to be used to predict MV1and MV2. However, if the estimation for MV3 needs to be performed on thefiltered L frames, then MV1 and MV2 already need to have been estimated.This is because they are used during the creation of the L frames. SoMV3 could not have been used for prediction during the estimation of MV1and MV2. If instead, the motion estimation for MV3 is performed onunfiltered frames (i.e. the original frames), then bottom-up predictionduring estimation can be used. However, the gains are likely to be worsethan for the UMCTF scheme. Of course, bottom-up prediction embodimentcan be used during the coding of the motion vectors (with no predictionduring the estimation), however, as mentioned with regard to thetop-down embodiment, there may exist some mismatch between the motionvectors at different levels.

Referring now to the flow chart of FIG. 17, there is shown a mixed,hybrid prediction and coding embodiment of the method of the presentinvention. In this embodiment, instead of using MVs from onedecomposition level to predict MVs from other levels, a combination ofMVs from different levels are used to predict other MVs. For example, ahigher level MV(s) and forward MV(s) from the current level may be usedto predict a backward MV(s). Using the simplified example of FIG. 3again, step 400 includes determining MV1 and MV3, both of which may bedetermined conventionally by the motion estimation unit 132, at levels 0(MV1) and level 1 (MV3) during motion estimation. In step 410, MV2 forlevel 0 is estimated or predicted as a refinement based on MV1 and MV3.The estimation of MV2 may be obtained by calculating the differencebetween MV1 and MV3. In step 420, the entropy encoding unit 160 (FIG. 2)entropy encodes MV1 and MV3. The method may end here or optionally instep 430, the entropy encoding unit 160 may also encode the refinementsfor MV2.

FIG. 18 shows an embodiment of a decoder which may be used forimplementing the principles of the present invention, denoted by numeral500. The decoder 500 includes an entropy decoding unit 510 for decodingthe incoming bit-stream. During operation, the input bit-stream will bedecoded according to the inverse of the entropy coding techniqueperformed on the encoding side, which will produce wavelet coefficientsthat correspond to each GOF. Further, the entropy decoding produces theMVs including the MVs predicted in accordance with the presentinvention, and frame numbers that will be utilized later.

A significance decoding unit 520 is included in order to decode thewavelet coefficients from the entropy decoding unit 510 according tosignificance information. Therefore, during operation, the waveletcoefficients will be ordered according to the correct spatial order byusing the inverse of the technique used on the encoder side. As can befurther seen, a spatial recomposition unit 530 is also included totransform the wavelet coefficients from the significance decoding unit520 into partially decoded frames. During operation, the waveletcoefficients corresponding to each GOF will be transformed according tothe inverse of the wavelet transform performed on the encoder side. Thiswill produce partially decoded frames that have been motion compensatedtemporally filtered according to the present invention.

As previously described, the motion compensated temporal filteringaccording to the present invention resulted in each GOF beingrepresented by a number of H-frames and an A-frames. The H-frame beingthe difference between each frame in the GOP and the other frames in thesame GOP, and the A-frame being either the first or last frame notprocessed by the motion estimation and temporal filtering on the encoderside. An inverse temporal filtering unit 540 is included to reconstructthe H-frames included in each GOP from the spatial recomposition unit530, based on the MVs and frame numbers provided by the entropy decodingunit 510, by performing the inverse of the temporal filtering performedon the encoder side.

FIG. 19 shows an embodiment of a system in which the principles of thepresent invention may be implemented, denoted by numeral 600. By way ofexample, the system 600 may represent a television, a set-top box, adesktop, laptop or palmtop computer, a personal digital assistant (PDA),a video/image storage device such as a video cassette recorder (VCR), adigital video recorder (DVR), a TiVO device, etc., as well as portionsor combinations of these and other devices. The system 600 includes oneor more video sources 610, one or more input/output devices 620, aprocessor 630, a memory 640 and a display device 650.

The video/image source(s) 610 may represent, e.g., a televisionreceiver, a VCR or other video/image storage device. The source(s) 610may alternatively represent one or more network connections forreceiving video from a server or servers over, e.g., a global computercommunications network such as the Internet, a wide area network, ametropolitan area network, a local area network, a terrestrial broadcastsystem, a cable network, a satellite network, a wireless network, or atelephone network, as well as portions or combinations of these andother types of networks.

The input/output devices 620, processor 630 and memory 640 communicateover a communication medium 650. The communication medium 650 mayrepresent, e.g., a bus, a communication network, one or more internalconnections of a circuit, circuit card or other device, as well asportions and combinations of these and other communication media. Inputvideo data from the source(s) 610 is processed in accordance with one ormore software programs stored in memory 640 and executed by processor630 in order to generate output video/images supplied to the displaydevice 650.

In particular, the software programs stored in memory 640 may includethe method of the present invention, as described previously. In thisembodiment, the method of the present invention may be implemented bycomputer readable code executed by the system 600. The code may bestored in the memory 640 or read/downloaded from a memory medium such asa CD-ROM or floppy disk. In other embodiments, hardware circuitry may beused in place of, or in combination with, software instructions toimplement the invention.

The temporal MV prediction across multiple levels of the temporaldecomposition, in the MCTF framework are necessary to efficiently codethe additional sets of motion vectors that are generated within theUMCTF and lifting based MCTF frameworks. The MVs may be codeddifferentially, where the estimation process uses no prediction, or whenthe estimation also uses temporal prediction. Although the top-downembodiment is more efficient, it does not support temporal scalability,as with the bottom-up embodiment. When the motion is temporallycorrelated, the use of these schemes can reduce the MV bits by around5-13% over no prediction and by around 3-5% over spatial prediction. Dueto this reduction in MV bits, more bits can be allocated to the texturecoding, and hence the resulting PSNR improves. PSNR improvements ofaround 0.1-0.2 dB at 50 Kbps have been observed for QCIF sequences.Importantly, the results indicate a great content dependence. In fact,for GOFs with temporally correlated motion, such schemes cansignificantly reduce the MV bits, and can improve the PSNR by up to 0.4dB. Thus, the method of the invention can be used adaptively, based onthe content and the nature of motion. The improvements achieved with thepresent invention are likely to be more significant when multiplereference frames are used, due to the greater temporal correlation thatcan be exploited. When MV prediction is used during motion estimation,different tradeoffs can be made between the bit rate, the quality andthe complexity of the motion estimation.

While the present invention has been described above in terms ofspecific embodiments, it is to be understood that the invention is notintended to be confined or limited thereto. Therefore, the presentinvention is intended to cover various structures and modificationsthereof included within the spirit and scope of the appended claims.

1. A method for encoding a video, the method comprising the steps of:dividing (120) the video into a group of frames; temporally filtering(134) the frames to provide at least first and second temporaldecomposition levels; determining (132, 200) at least two motion vectorsfrom the first decomposition level; estimating (210) at least one motionvector on the second temporal decomposition level as a refinement of theat least two motion vectors from the first temporal decomposition level;and encoding (220) the at least two motion vectors from the firsttemporal decomposition level.
 2. The method according to claim 1,further comprising the step of encoding (230) the estimated at least onemotion vector of the second temporal decomposition level.
 3. A methodfor encoding a video, the method comprising the steps of: dividing (120)the video into a group of frames; temporally filtering (134) the framesto provide at least first and second temporal decomposition levels;determining (132, 300) at least one motion vector from the secondtemporal decomposition level; estimating (310) at least two motionvectors on the first temporal decomposition level as a refinement of theat least one motion vector from the second temporal decomposition level;and encoding (320) the at least one motion vector from the secondtemporal decomposition level.
 4. The method according to claim 3,further comprising the step of encoding (330) the estimated at least twomotion vectors of the first temporal decomposition level.
 5. A methodfor encoding a video, the method comprising the steps of: dividing (120)the video into a group of frames; temporally filtering (134) the framesto provide at least first and second temporal decomposition levels;determining (132, 400) at least one motion vector from the firsttemporal decomposition level and at least one motion vector from thesecond temporal decomposition level; estimating (410) at least a secondmotion vector of the first temporal decomposition level as a refinementof the at least one motion vector from the first temporal decompositionlevel and the at least one motion vector from the second temporaldecomposition level; and encoding (420) the at least one motion vectorfrom the first temporal decomposition level and the at least one motionvector from the second temporal decomposition level.
 6. The methodaccording to claim 5, further comprising the step of encoding (430) theestimated at least second motion vector of the first temporaldecomposition level.
 7. An apparatus for encoding a video comprising:means (120) for dividing the video into a group of frames; means (134)for temporally filtering the frames to provide at least first and secondtemporal decomposition levels; means (132, 200) for determining at leasttwo motion vectors from the first temporal decomposition level; means(210) for estimating at least one motion vector on the second temporaldecomposition level as a refinement of the at least two motion vectorsfrom the first temporal decomposition level; and means (220) forencoding the at least two motion vectors from the first temporaldecomposition level.
 8. The apparatus according to claim 7, furthercomprising means (230) for encoding the estimated at least one motionvector of the second temporal decomposition level.
 9. A memory mediumfor encoding a video comprising: code (120) for dividing the video intoa group of frames; code (134) for temporally filtering the frames toprovide at least first and second temporal decomposition levels; code(132, 200) for determining at least two motion vectors from the firsttemporal decomposition level; code (210) for estimating at least onemotion vector on the second temporal decomposition level as a refinementof the at least two motion vectors from the first temporal decompositionlevel; and code (220) for encoding the at least two motion vectors fromthe first temporal decomposition level.
 10. The memory medium accordingto claim 9, further comprising code (230) for encoding the estimated atleast one motion vector of the second temporal decomposition level. 11.An apparatus for encoding a video comprising: means (120) for dividingthe video into a group of frames; means (134) for temporally filteringthe frames to provide at least first and second temporal decompositionlevels; means (132, 300) for determining at least one motion vector fromthe second temporal decomposition level; means (310) for estimating atleast two motion vectors on the first temporal decomposition level as arefinement of the at least one motion vector from the second temporaldecomposition level; and means (320) for encoding the at least onemotion vector from the second temporal decomposition level.
 12. Theapparatus according to claim 11, further comprising means (330) forencoding the estimated at least two motion vectors of the first temporaldecomposition level.
 13. A memory medium for encoding a videocomprising: code (120) for dividing the video into a group of frames;code (134) for temporally filtering the frames to provide at least firstand second temporal decomposition levels; code (132, 300) fordetermining at least one motion vector from the second temporaldecomposition level; code (310) for estimating at least two motionvectors on the first temporal decomposition level as a refinement of theat least one motion vector from the second temporal decomposition level;and code (320) for encoding the at least one motion vector from thesecond temporal decomposition level.
 14. The memory medium according toclaim 13, further comprising code (330) for encoding the estimated atleast two motion vectors of the first temporal decomposition level. 15.An apparatus for encoding a video comprising: means (120) for dividingthe video into a group of frames; temporally filtering (134) the framesto provide at least first and second temporal decomposition levels;means (132, 400) for determining at least one motion vector from thefirst temporal decomposition level and at least one motion vector fromthe second temporal decomposition level; means (410) for estimating atleast a second motion vector of the first temporal decomposition levelas a refinement of the at least one motion vector from the firsttemporal decomposition level and the at least one motion vector from thesecond temporal decomposition level; and means (420) for encoding the atleast one motion vector from the first temporal decomposition level andthe at least one motion vector from the second temporal decompositionlevel.
 16. The apparatus according to claim 15, further comprising means(430) for encoding the estimated at least second motion vector of thefirst temporal decomposition level.
 17. A memory medium for encoding avideo comprising: code (120) for dividing the video into a group offrames; code (132, 400) for determining at least one motion vector fromthe first temporal decomposition level and at least one motion vectorfrom the second temporal decomposition level; code (410) for estimatingat least a second motion vector of the first temporal decompositionlevel as a refinement of the at least one motion vector from the firsttemporal decomposition level and the at least one motion vector from thesecond temporal decomposition level; and code (420) for encoding the atleast one motion vector from the first temporal decomposition level andthe at least one motion vector from the second temporal decompositionlevel.
 18. The memory medium according to claim 17, further comprisingcode (430) for encoding the estimated at least second motion vector ofthe first temporal decomposition level.